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ABSTRACT 



A new information- theoretically justified approach to 
missing data estimation for multivariate categorical data was studied. The 
approach is a model -based imputation procedure relative to a model class 
(i.e., a functional form for the probability distribution of the complete 
data matrix) , which in this case is the set of multinomial models with some 
independence assumptions. Based on the given model class assumption, an 
inf ormation- theoretic criterion can be derived to select between the 
different complete data matrices. Intuitively this general criterion, called 
stochastic complexity, represents the shortest code length needed for coding 
the complete data matrix relative to the model class chosen. Using these 
inf ormation- theoretic criteria, the missing data problem is reduced to a 
search problem, that of finding the data completion with minimal stochastic 
complexity. The results of two empirical studies of the approach, using 
educational data sets of 478 elementary school students ("Popular kids" - 
POPKIDS in Michigan) and 500 Irish schoolchildren ("Irish educational 
transitions data - Irish) , are presented and compared to those achieved with 
commonly used techniques such as case deletion and imputation of sample 
averages. (Contains 3 figures, 6 tables, and 36 references.) (Author/SLD) 
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In this paper we study a new in formation-theoretical ly justified approach to missing data estimation 
for multivariate categorical data. The approach discussed is a model-based imputation procedure 
relative to a model class (i.e., a functional form for the probability distribution of the complete data 
matrix), which in our case is the set of multinomial models with some independence assumptions. 
Based on the given model class assumption an information-theoretic criterion can be derived to select 
between the different complete data matrices. Intuitively this general criterion, called stochastic com- 
plexity, represents the shortest code length needed for coding the complete data matrix relative to the 
model class chosen. Using this information-theoretic criteria, the missing data problem is reduced 
to a search problem, i.e., finding the data completion with minimal stochastic complexity. In the 
experimental part of the paper we present empirical results of the approach using two real data sets, 
and compare these results to those achived by commonly used techniques such as case deletion and 
imputating sample averages. 



Introduction 

In most educational research contexts the available data 
are typically incomplete and contain usually several ele- 
ments with missing information. Missing elements are par- 
ticularly typical to data from complex questionnaire based 
surveys, where the lack of time or low motivation of the 
respondents result in neglection of many of the questions. 
In most cases omission of incomplete data, i.e., concentra- 
tion on only records with complete data, is infeasible, as the 
amount of data for the analysis would be drastically reduced. 
Therefore intelligent methods for handling missing data are 
an important aspect of quantitative data analysis. 

The problem of missing data estimation has been ad- 
dressed widely in the statistics literature (see e.g., (Gelman, 
Carlin, Stern, & Rubin, 1995; Rubin, 1987, 1996; Schafer, 
1995)). The last quarter of a century has seen many de- 
velopments in this area. The EM algorithm together with 
its extensions (Dempster, Laird, & Rubin, 1977; McLachlan 
& Thriyambakam, 1997), multiple imputation (Rubin, 1987, 
1996; Schafer, 1995) and Markov Chain Monte Carlo (Gilks, 
Richardson, & J., 1996) all provide tools for inference in 
large classes of missing data problems. In practice, however, 
these developments have not had large impact on the way 
most data analysts handle missing values on a routine ba- 
sis. This is partly due to the rather complex nature of these 
approaches, but mostly because of their lack of support in 
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statistical software. 

The purpose of this paper is to study a new information- 
theoretically justified approach to missing data estimation. 
The method discussed is deeply related to Bayesian infer- 
ence, but originates from the research on universal cod- 
ing (Rissanen, 1984), which aims at finding good (short) en- 
codings of data. We do not make an attempt to provide a 
survey of the aforementioned developments in missing data 
estimation — an interested reader can consult the excellent 
books by Little and Rubin (1987) and Schafer (1997). In or- 
der to put our work in perspective, however, we would like to 
remind that the proposed methods can essentially be catego- 
rized into two general approaches: case deletion and imputa- 
tion (Little & Rubin, 1987; Schafer, 1997). In case deletion 
all the cases with missing data are omitted and the analysis is 
performed only using the complete cases. Obviously this ap- 
proach is a reasonable solution only if the incomplete cases 
comprise a small fraction of all cases. In imputation-based 
procedures the missing data values are filled with plausi- 
ble values which forces the incomplete data set into a com- 
plete data format. The methods in this group vary from sim- 
ple sample average imputation approaches (Little & Rubin, 
1987) to complex multiple imputation procedures (Schafer, 
1997). The latter share the same underlying philosophy 
as EM and data augmentation: an incomplete-data prob- 
lem is solved by repeatedly solving the complete-data ver- 
sion. In multiple imputation the unknown missing data are 
replaced by several “simulated” values using Monte Carlo 
approaches, and each of the resulting complete data sets is 
analyzed by standard complete data methods. The resulting 
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variability is then taken to reflect the uncertainty caused by 
the missing data. 

The approach discussed here can be characterized as 
a model-based imputation procedure. Of the existing ap- 
proaches, the method described here is somewhat related 
to Bayesianly proper multiple imputation (Schafer, 1997), 
which uses independent realizations of the posterior predic- 
tive distribution of the missing data under some complete- 
data model and prior. The method discussed here is similar 
in the sense that it is always relative to a model class, and 
the criterion used for finding the values to be imputed can be 
approximated by the Bayesian marginal likelihood (Berger, 
1985; Bernardo & Smith, 1994; O’Hagan, 1994). However, 
we are interested in the problem of finding a single optimal 
completion of the incomplete data set instead of a set of com- 
pletions typical to multiple imputation procedures. More- 
over, as we will see in the next section, the augmentation 
criterion has its foundations in information and coding the- 
ory (Cover & Thomas, 1991) rather than Bayesian statistics. 

Intuitively the approach can be described as follows. By 
modeling the set of data records as a matrix of incomplete 
discrete data, the missing part is estimated by assuming a 
functional form for the probability distribution of the data 
cases (i.e., a model class). Based on the given model class 
assumption an information-theoretic criteria can be derived 
to select between the different complete data matrices for the 
more “likely” one (in abstract sense). Intuitively this general 
criteria, called stochastic complexity (Rissanen, 1987, 1989, 
1996) represents the shortest code length needed for cod- 
ing the complete data matrix relative to the model class cho- 
sen. Unfortunately in general the exact criteria is very hard 
to compute for many interesting model families, but it can 
be approximated by the Bayesian marginal likelihood com- 
puted by integrating over all the possible models (parameter 
settings) in the chosen model class. 

Since we are interested in categorical data, we use the 
set of (saturated) multinomial models for the complete data, 
which is a more general descriptive “language” than the mul- 
tivariate normal class in the sense that it allows also for 
higher than two-way associations among the variables. In 
addition to selecting the multinomial model class additional 
independence assumptions for the variables have to be made 
to make the approach feasible in practice. This leads us to 
consider a special subclass of finite mixture models (Titter- 
ington. Smith, & Makov, 1985) known as Naive Bayes mod- 
els. 

Having defined an evaluation criterion for our data com- 
pletions, the missing data problem is reduced to a search 
problem, where the goal is to find a data completion that min- 
imizes the stochastic complexity of the completed matrix. 
Due to the large discrete search space exhaustive search for 
the minimal stochastic complexity completion among all the 
possible completions is not feasible for data sets with large 
fractions of missing data. However, locally optimal solutions 
can be found by using stochastic search methods such as EM 



and simulated annealing (Aarts & Korst, 1989). In this pa- 
per for the experiments we use a simple easy-to-implement 
variant called stochastic greedy, which has a comparable per- 
formance to the more complex search methods (Kontkanen, 
Myllymaki, Silander, & Tirri, 1997a). 

In the experimental part of the paper we present empir- 
ical results of the approach using two real data sets, and 
compare these results to those achieved by commonly used 
techniques such as case deletion and imputating sample av- 
erages. It should be observed that albeit we discuss the finite 
mixture model class, the approach presented is general and 
applicable to imputation of categorical data with other model 
classes also. 

The missing data problem 

We will consider rectangular data sets whose rows can be 
modeled as independent, identically distributed (iid) draws 
from some multivariate probability distribution. The rows 
represent observational units and the columns represent vari- 
ables recorded for those units. In the following such rows 
of the matrix are called data vectors d, and data set D is 
defined as a set of N data vectors d\ 9 ... 9 d^. Each com- 
plete data vector d consists of m -f 1 value assignments, 
d — {X i ,X m +i =x„,+i), where each value*; is as- 
sumed to belong to the discrete set {jc/i , . . . Conse- 

quently, a complete data set D can be regarded asaA/xm-t*l 
matrix, where each component dji is a value assignment of 
the form < X; = jc/ >. In the incomplete data case, one or 
more of these assignments are initially unknown. In the se- 
quel we partition the matrix D into two sets of components, 
D — (A>bs» Anis)* where D 0 b S denotes the constant compo- 
nents which are originally given (the observed data), and 
Anis the missing components which are to be estimated by 
using D 0 b S . The missing data estimation task is to augment 
the missing values, i.e., assign values to elements in Anis* in 
optimal manner with respect to the inference tasks for which 
the data is to be used. Here we restrict ourselves to predic- 
tive inference tasks (Bernardo & Smith, 1994; Gelman et al., 
1995), i.e., we aim at developing augmentation methods that 
produce completions which result in good predictive perfor- 
mance, when the completed data is used to build a predictive 
model. 

Completion criteria: Stochastic 
complexity 

As discussed earlier, the approach adopted here is based 
on “scoring” alternative completions of the missing data 
Anis based on an information-theoretic criterion. This cri- 
terion can be derived from the Minimum Description Length 
(MDL) Principle developed by Rissanen (1989,1996). Ac- 
cording to MDL, the goal of all (inductive) inference from 
data is to compress the given data as much as possible, i.e., 
to describe it using as few bits as possible. Intuitively such 
an argumentation can be justified by the fact that to compress 
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data, one needs to find out regularities in it. The more we are 
able to compress the data, the more regularities (e.g., depen- 
dencies) we have found. Thus to be able to find the shortest 
encoding of data, we need to have extracted all the existing 
regularities. These regularities can then be used to character- 
ize the underlying process generating the data, which is the 
purpose of modeling in the first place. 

Data compression involves the use of a description 
method or code , which is a one-one mapping from datasets 
to their descriptions. Without loss of generality, these de- 
scriptions may be taken to be binary strings (Rissanen, 
1989). Intuitively, the shorter the description or codelength 
of a set of D, the more regular or simpler the set D is. Ris- 
sanen (1987) defines the stochastic complexity informally as 
follows: 

The stochastic complexity of the data set D with 
respect to the model class M is the shortest code 
length of D obtainable when the encoding is 
done with the help of class M (Rissanen, 1987, 

1996). 

Here “with the help of” has a clear intuitive meaning: if 
there exists a model in fyVf which captures the regularities 
in D well, or equivalently gives a good fit to D, then the 
code length of D should be short. However, it turns out to be 
very hard to define “with the help of” in a formal manner . 
Indeed, a completely satisfactory formal definition has only 
been found very recently (Rissanen, 1996). 

Note that the informal definition of stochastic complex- 
ity (SC) as given above presumes the existence of a code: by 
definition, the SC of a data set D is the length of the encoding 
of D where the encoding is done using some special code C* 
which gives “the shortest possible codelengths with respect 
to In order to introduce a formula for the codelengths 
obtained using this C*, the connection between probability 
distributions and codes has to be first clarified. 

In general, we denote the length (in bits) of the encoding 
of D when the encoding is done using a code C by Lc(D). All 
codes considered in MDL are prefix codes (Rissanen, 1989). 
From the Kraft inequality (see for example (Rissanen, 1989) 
or (Cover & Thomas, 1991)) it follows that for every prefix 
code C, there exists a corresponding probability distribution 
P such that for all data sets D of given length N (i.e., with 
N data instantiations), we have — log P(D) = Lc(D) 1 . Sim- 
ilarly, for every probability distribution P defined over all 
data sets D of length N there exists a code C such that for all 
datasets D of length /V, we have Lc{D) = [— logP(D)] (here 
\x] is the smallest integer greater or equal to x). If we use 
— logP(D) instead of [— logP(D)], our code lengths will 
always be less than one bit off the mark; we may therefore 
safely neglect the integer requirement for code lengths (Ris- 
sanen, 1987). Once we have done this, the two facts above 



throughout this paper, by “log” we denote logarithm to the 
base two. 



imply that we can interpret any probability distribution over 
sequences of a given length as a code and vice versa. This 
correspondence allows us to identify codes and probability 
distributions: every probability distribution P over data sets 
of length N may equivalently be interpreted as determining 
a code C such that Lc(D) = - log P(D) for all D of length 
N. We see that a short code length corresponds to a high 
probability and vice versa: whenever P(D) > P(D f ), we have 
— logP(D) < — logP(D'). 

If our parametric class of models is regular enough 
(as it will indeed be for all instantiations of fyVf we consider 
in this paper), then there exists a maximum likelihood (ML) 
estimator © for every data set D, and we can write: 

©(D) = arg max P(D\Q) 

&£!M 

= arg min — logP(D|@) 

&€!M 

= arg min L(D|0), (1) 

where the last equality indicates the fact that each 0 defines a 
code such that the code length of D is given by — logP(D|@). 
Since this term can be interpreted as a code length, we denote 
it by L(D|0). 

Let us now consider a data set D of arbitrary but fixed 
length N. The MDL Principle tells us to look for a short 
encoding of D. The model within class fyVf that compresses 
the data most is the ML model ©(D), since by (1) it is the 
model for which L(D|0), the codelength of D when en- 
coded with (the code corresponding to) ©, is lowest. At 
first sight it seems that we should code our data D us- 
ing ©(D), in which case the MDL Principle would reduce 
to the maximum likelihood method of classical statistics. 
However — and this is the crucial observation which makes 
MDL very different from maximum likelihood principle — 
MDL says that we must code our data using some fixed 
code, which compresses all data sets for which there is a 
good-fitting model in (Rissanen, 1987). But the code 
corresponding to ©(D), i.e., the code that encodes any D' 
using L(D'|©(D)) = — logP(D'|©(D)) bits, only gives opti- 
mal compression for some data sets (including D). For most 
other data sets D' ^ D, ©(D) will definitely not be optimal: 
if we had been given such a different data set D' (also of 
length N) instead of D, then the code corresponding to 0(D') 
rather than ©(D) would give us the optimal compression. In 
general, coding D ' using ©(D) (i.e., using L(D'|©(D)) bits) 
may be very inefficient. 

As discussed above, MDL says that we must code our 
data using some fixed code, which compresses all data sets 
that are well modeled by 5Vf. We can therefore not use the 
code based on ©(D) if our data happens to be D and the code 
based on 0(D') if our data happens to be O': we would then 
encode D using a different code than when encoding D 7 . It 
would thus be very desirable if we could come up with a code 
that compresses each possible D as well as the maximum- 
likelihood, or equivalently, mostly-compressing element in 
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for that specific D. In other words, we would like to have 
a single code Cj such that Lq x (D) = L(D|0(D)) for all pos- 
sible D. However, such a code cannot exist as soon as our 
model class contains more than one element, since in gen- 
eral a code can only give short codelengths to a very limited 
number of data instantiations. Nevertheless, it is possible to 
construct a code C 2 such that 

Lc 2 (D) = - log/>(D|0(D)) + K n = L(D|0(D)) + K N (2) 

for all D of length N . Here K /v is a constant that may depend 
on N but is equal for all D of length N. If, for some 0 E 
we say that it fits the data D well, we mean that the probabil- 
ity F(D|0) is high. Note that the code length obtained using 
C 2 precisely reflects for each D how well D is fitted by the 
model in the class that fits D best. 

Picking C 2 such that the constant Kn is as small as pos- 
sible yields the most efficient code that satisfies (2). We call 
the resulting code the stochastic complexity code and denote 
it by C*. The corresponding minimal K v is denoted by 
and is called the model cost of Consequently we are 
finally ready to give the formal definition of the stochastic 
complexity: the code length of D when encoded using the C* 
code is the stochastic complexity of D with respect to model 
class which we write as SC{D\%{): 

SC(D\M) = Lc*(D) 

= L(D|0(D)) + K* n where 0(D) E 7tf(3) 

In many situations the model classes are such that (3) 
cannot be easily calculated. Fortunately there exist several 
good approximations to SC (Rissanen, 1989, 1996). One 
good approximation is based on the equivalence between 
codes and probability distributions discussed earlier. As ar- 
gued before, we can map code C* to a probability distribu- 
tion P* such that for all D, — logP*(D) = SC. We saw that 
C* can be seen as the code giving the shortest code lengths 
with respect to I3T , similarly P * can be seen as the probability 
distribution giving “as much probability as possible” to those 
data sets for which there is a good model in <M . . A good can- 
didate for such a distribution is the Bayesian marginal like- 
lihood P{D\ < M) (Bernardo & Smith, 1994), also sometimes 
known as the evidence , which can be computed by integrat- 
ing over all possible models (parameter settings) 0 , 

P{D\M) = />(D 0 bs,Anis \M) 

= J P(D obS! D mis |!^ ! 0)P(©|!^) d©. (4) 

As discussed in (Rissanen, 1996) for many model classes (4) 
approximates SC{D\ < M ) extremely well, and thus in the se- 
quel we will use this approximation as the “pragmatic” defi- 
nition of stochastic complexity. 

Stochastic complexity for Naive 
Bayes models 

Since we are interested in categorical data, we use the 
set of (saturated) multinomial models for the complete data. 



which is a more general descriptive “language” than the mul- 
tivariate normal class in the sense that it allows also for 
higher than two-way associations among the variables. In 
addition to selecting the multinomial model class some in- 
dependence assumptions for the variables have to be made 
to make the stochastic complexity approach feasible in prac- 
tice. This leads us to consider a special subclass of finite 
mixture models (Titterington et al., 1985) known as Naive 
Bayes models. For Naive Bayes model class, the categori- 
cal variables X/,i ^ s are assumed to be independent, given 
the values of a specific observed variable X s often called the 
class variable . For notational convenience we index these 
independent variables from 1,... ,m. From this assumption 
it follows that the joint probability distribution for a data vec- 
tor d can be written as 

P(d) = P(X\ =x h ... ,X m =x m ,X s = k) 

m 

= P(X S = k) fl P(X = = k) . (5) 

1=1 

Consequently, in the Naive Bayes model case, distribution 
P(d) can be uniquely determined by fixing the values of the 
parameters 0 = (a,d>), 

a = (oti , . . . ,a K ), and 

O = (4>n,... ,4>i 

where the value of parameter a* gives the probability 
P(X S = k ) and 

= > • • • j §kirij ) 3 

where ^7 = P{X t = xu \X S = k ) . 

Here m denotes the number of possible values for variable X/, 
and K the number of values for variable X s . Using these de- 
notations, we can now write 

m 

P(d) — P{X\ — *l/j » • • • iX m — X m i m ,X S ~ k) = CLfc * 

i=l 

In the following we assume that a* > 0 and fail > 0 
for all &,/, and /. Furthermore, both the variable distribu- 
tion P(X S ) and the conditional distributions P(Xj\X s = k) 
are multinomial, i.e., X s ~ Multi(l;(Xi ,... , 0 ^), and X^ k ~ 
Multi ( 1 ; 9 - .. 3 tykitii ) • Since the family of Dirichlet densi- 
ties is conjugate (see e.g., (DeGroot, 1970)) to the family 
of multinomials, it is convenient to assume that the prior 
distributions of the parameters are from this family (see, 
e.g., (Heckerman, Geiger, & Chickering, 1995)). More pre- 
cisely, let 

(ai,... ,<x*) ~ ,Pk) , and 

(fel > • • • 3 ^ Bi (Oki\ j • • • 3 Okini) j 

where {pk^kii | k = 1, . . . ,n c \i = 1 ,... ,m;/ =1,... ,/*/} are 
the hyperparameters of the corresponding distributions. For 
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Figure 1. The Expectation- Maximization algorithm for stochastic 
complexity minimization. 

Algorithm 1 

Expectation- Maximization ( EM ) 

1. Set t — 0. Initialize parameters ©M randomly. 

2. E-step: Determine 

2 ( 0 , © (,) ) = 

E[log P(D obs , |0, M)P(Q\M) |D obs ,©<'>, M\ , 

where are the parameter estimates in time step t. 

3. M-step: Set = argmax{<2(0,0^)|0 C Q)}. 

e 

4. Set t = t + 1. Goto 2 if not converged. 



simplicity, we will use here the uniform prior for the param- 
eters, so all fife and o k n are set to 1. A more detailed discus- 
sion on the priors can be found in (Kontkanen, Myllymaki, 
Silander, Tirri, & Griinwald, 1998, 1997). 

As shown in (Cooper & Herskovits, 1992; Heckerman 
et al., 1995), with the above assumptions the posterior proba- 
bility of complete data (D ob s,Dmis) for a Naive Bayes model 
where n c is the number of values for X s is 

^(^obsj^mis | M/if) 

= J /’(Dobs, A™ I 0,M„ c )P(Q I M„ c ) d © 

_ r (Xm ^ n r) j-T r(hk + m k ) /z-x 

r{N+r k i^k)L\ m) 

rrrr / /=i ° k n) rr ^(fkii + Qkii) \ 

UU\ r (** + iai®«i)iU nota) )' 

Computing the stochastic complexity measure for the in- 
complete data matrix requires marginalizing out the missing 
data Dmis, i.e., 

^C(D 0 bs | M t , r ) = — logP(D 0 b S | M r , r ) 

= —log ^ ^(DobsjDn^s | M /lf ), 

^inis 

where P(D 0 b S ,Anis \ M tlc ) is given by ( 6 ), and the (expo- 
nential) sum goes over all the possible assignments of the 
missing data elements. 

Search methods 

Due to the exponential number of differenf completions 
of D m i S , for real data sets we cannot calculate SC for all 
possible completions. In general we have several alterna- 
tive approaches for searching the data matrix completion 
with the minimal stochastic complexity. One possibility is 
to use a variant of the Expectation- Maximization (EM) algo- 
rithm (Dempster et al., 1977), which consists of two abstract 



Figure 2. The Simulated Annealing-algorithm for stochastic com- 
plexity minimization. 

Algorithm 2 

Simulated Annealing (SA) 

1. Generate an initial random guess D m j S of the miss- 
ing data. Set the temperature parameter T to its initial 
value. 

2. Repeat L times 

(a) Generate a new candidate D m j S for the missing data 

by changing the estimate of one randomly chosen 
missing data item in to a randomly chosen 
value. 

(b) If /^(Dobs, ^mis j -^0 ^ F(Z) 0 bs 5 ^mis | ^0 then 

set D m i S = D m i s . 

(c> E,se if ( steads ) 1/r > Rand ° m (°> 0 then 

set D m i s = D m j s . 

3. T = F * T. If not converged, goto 2. 



steps, Expectation(E) and Maximization(M), presented in 
Figure 1. 

One should observe that the EM algorithm does not pro- 
vide an estimate of the missing data directly. The EM al- 
gorithm maximizes P(Q\D 0 b Sy fM), and the resulting candi- 
date for maximum posterior probability model 0 can then be 
used for estimating the missing data D mis . However, it can 
be shown that the stochastic complexity can be approximated 
by 

SC(D obs , DmisIfW) « logP(D obs ,D mis |0,^)P(0|^)-C 5 

(7) 

where C is a constant depending on the number of the model 
parameters, and the number of the data vectors (Schwarz, 
1978; Rissanen, 1989). As the expectation of the first term 
of this approximation is maximized during the EM process, 
it can be argued that the EM optimizes the stochastic com- 
plexity indirectly by optimizing the approximation ( 7 ). 

An alternative is to use simulated annealing 
(SA) (Metropolis, Rosenbluth, Rosenbluth, Teller, & 
Teller, 1953; Kirkpatrick, Gelatt, & Vecchi, 1983), a 
stochastic global optimization method belonging to the 
family of Markov Chain Monte Carlo (MCMC) stochastic 
simulation algorithms. A commonly used version of SA 
goes is given in Figure 2. In this scheme, the cooling 
factor F is a constant parameter smaller than one. The 
SA algorithm converges as the temperature T approaches 
zero. It can be shown that if the initial temperature is high 
enough, and the decrement of the parameter is done slowly 
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enough, the process converges to the global optimum almost 
surely (Aarts & Korst, 1989). 

Finally, by the stochastic greedy (SG) algorithm we mean 
a simple procedure where new solution candidates are gen- 
erated as in simulated annealing, but the candidates are ac- 
cepted only if the marginal likelihood is increased. Due to 
its simplicity and good performance, this stochastic greedy 
approach was used in the experimental part of the paper. 

Experiments 

Data 

The use of the stochastic complexity based imputation 
will be illustrated using two different educational data sets 
containing categorical variables. 

Subjects of the first “Popular kids” data set (POPKIDS) 
were students in grades 4-6 from three school districts in In- 
gham and Clinton Counties, Michigan. The data consists of 
478 students were selected from urban, suburban, and rural 
school districts. In the collected questionnaires students in- 
dicated whether good grades, athletic ability, or popularity 
was most important to them. They also ranked four factors: 
grades, sports, looks, and money, in order of their impor- 
tance for popularity. The questionnaire also asked for gen- 
der, grade level, and other demographic information. The 
study involved a classification task of correctly identifying 
the one of the six schools using the other variables as predic- 
tors. 

The second data set was “Irish educational transitions 
data” (IRISH) (Greaney & Kelleghan, 1984) reanalyzed 
by Raftery and Hout (1993). Subjects of this data set 
were 500 Irish schoolchildren aged 11 in 1967. The data 
were also used, in a simplified form, as an example to 
illustrate Bayesian model selection methods by Kass and 
Raftery (1994). The data had 6 variables, and the classifi- 
cation task was to predict the educational level (11 levels) of 
the students. 

Experimental setting 

In order to validate our approach, we produced synthetic 
missing data problems from the above real data sets by ran- 
domly deleting a known portion of the data. In the experi- 
ments also the sample sizes were controlled. Based on the 
different completions of we performed a classification 
analysis, i.e., used the completed data sets to solve a classi- 
fication problem in order to study the practical implications 
of different procedures for handling missing data. 

For each data set, we created three different subsam- 
ples of sizes 10% 25% and 50% of the original data set 
size. Each time 50% of the data was reserved for the sub- 
sequent out-of-sample classification analysis. In each data 
sample we deleted (completely at random) 5%, 10%, 20%, 
35% and 50% of the elements thus creating artificial missing 
data problems satisfying the missing completely at random 



Figure 3. The value of the stochastic complexity measure (y- 
axis) of the imputed POPKIDS data matrix (N=239) as a func- 
tion of the missing data percentage (x-axis). Different lines indi- 
cate the different imputation schemes (ORG=the original complete 
data matrix, AVG=imputing averages, RND=imputing by random, 
SC=imputing by minimizing the stochastic complexity). 




5 10 15 20 25 30 35 40 45 50 



(MCAR) assumption (Schafer, 1997). We then created com- 
plete data matrices by imputing the missing values using two 
alternative techniques. The first technique (AVG) was sim- 
ply to impute the averages in the observed data. The second 
technique (SC) was to use a simple greedy algorithm where 
missing values were first imputed by random values (RND) 
and then one by one replaced by the values that minimize 
the stochastic complexity of the data. Since the missing data 
situation was artificially created, we were able to also use 
(as a reference point) the “oracle method” of always imput- 
ing the original value (ORG). These two techniques (AVG 
and SC) together with the two extreme reference techniques 
(RND and ORG) were then evaluated using both the stochas- 
tic complexity measure, and the percentage of correctly im- 
puted values (correctness was judged by comparing the re- 
sults to the original data). For each data set the setting de- 
scribed above was created 100 times and the averages were 
computed. 

In order to study the practical implications qf the differ- 
ent imputation schemes with different sample sizes and dif- 
ferent percentages of missing data, all the imputations were 
used to build a model (classification rule) which was then 
used to classify previously unseen data items (train and test 
scheme). In this second set of experiments the result ob- 
tained by imputation schemes were further contrasted with 
the naive policy of building the model only using complete 
data items, i.e., those left intact by missing data generation. 
This policy corresponds to the case deletion methods widely 
used in practice. Again for each data set the setting described 
above was created 100 times and the averages were com- 
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Figure 4. The success rate (y-axis) of recovering the origi- 
nal complete POPKIDS data matrix (N=239) as a function of 
of the missing data percentage (x-axis). Different lines indi- 
cate the different imputation schemes (ORG=the original complete 
data matrix, AVG=imputing averages, RND=imputing by random, 
SC=imputing by minimizing the stochastic complexity). 




puted. 

Results 

In Figure 3 we can see a typical example of the stochastic 
complexity measure for the completed data as a function of 
the different missing data percentages (POPKIDS data set). 
It is worth noticing that while only the SC-scheme attempts 
to minimize the stochastic complexity, also imputing sam- 
ple averages (AVG) has the effect of minimizing complexity. 
When comparing the Figures 3 and 4 the percentage of cor- 
rectly imputed values can be seen to follow the (negative of) 
stochastic complexity measure and is thus consistent with 
the theoretical analysis. In addition it should be observed 
that with respect to the average imputation approach the dif- 
ference of recovering the original complete set is more than 
15% on the average, and about 20% for missing data per- 
centages less than 25%. 

Comparing performance in classification also shows the 
beneficial effect of SC-based imputation (see Figure 5). The 
results also clearly demonstrate the inferior performance of 
the case deletion approach for prediction tasks, from prag- 
matic point of view the other commonly used technique, im- 
putation of sample averages, is a significantly better alterna- 
tive. However, imputing completely at random (RND) seems 
to yield a surprisingly good classification results. This is due 
to the fact that if missing completely at random assumption 
holds even approximately, imputation of random values does 
not bias the model construction. 

The more detailed results for both data sets with all sam- 
ple sizes, missing data percentages, and imputation methods 



Figure 5. The classification accuracy (y-axis) as a function of 
of the missing data percentage (x-axis) when classifying 239 out- 
of-sample data vectors in the POPKIDS data set. Different lines 
indicate the performance of the models constructed from the com- 
plete data matrices obtained by different missing data handling 
schemes (ORG=the original complete data matrix, AVG=imputing 
averages, RND=imputing by random, SC=imputing by minimizing 
the stochastic complexity, DEL=case deletion). 




are listed in Tables 1-6. 

Summary and discussion 

We have studied the problem of missing data estimation, 
and proposed a new information-theoretically justified ap- 
proach to for incomplete multivariate categorical data. The 
approach discussed is a model-based imputation procedure 
relative to a model class, which in our case is the set of 
multinomial models with some independence assumptions. 
Based on the given model class assumption an information- 
theoretic criteria can be derived to select between the differ- 
ent complete data matrices. Thus the completion problem 
in this approach can be reduced to a search problem, i.e., 
finding the data completion with minimal criterion value. 

As demonstrated by the empirical results, the stochastic 
complexity based approach performs well both in recovering 
the original missing data. More importantly, it also succeeds 
in augmenting the incomplete data matrix in such a manner, 
that in a subsequent classification task the complete data can 
be used to build a model that predicts better than the models 
built from complete data matrices produced by alternative 
methods. From practitioners point of view this latter aspect 
is more interesting, as completion of the incomplete data ma- 
trix is usually only an intermediate stage to applying various 
analysis tasks. 

It should be observed that the approach adopted is very 
generic: changing the model class to a more descriptive 
one (e.g., to general graphical models such as Bayesian net- 
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Table i 

The value of the stochastic complexity measure of the im- 
puted POPKIDS data matrix with different sample sizes 
as a function of the missing data percentage (M%). Dif- 
ferent columns indicate the different imputation schemes 
(ORG-the original complete data matrix , AVG= imputing 
averages , RND=imputing by random, SC=imputing by min- 
imizing the stochastic complexity). 



Size 


M% 


AVG 


RND 


sc 




5 


638.83 


640.72 


631.23 




10 


641.07 


646.05 


627.29 


47 


20 


639.57 


653.97 


620.56 




35 


624.41 


662.24 


610.69 




50 


588.79 


667.62 


599.39 




5 


1554.06 


1563.32 


1524.90 




10 


1563.74 


1587.13 


1513.33 


119 


20 


1560.23 


1623.03 


1488.72 




35 


1509.18 


1661.02 


1452.33 




50 


1399.21 


1684.87 


1414.90 




5 


2999.84 


3027.69 


2928.01 




10 


3025.04 


3089.80 


2899.97 


239 


20 


3026.34 


3194.43 


2841.77 




35 


2915.83 


3304.16 


2747.40 




50 


2679.93 


3369.29 


2649.96 



Table 2 

The success rate of recovering the original complete 
POPKIDS data matrix with different sample sizes as a 
function of the missing data percentage (M%). Dif- 
ferent columns indicate the different imputation schemes 
( ORG-the original complete data matrix , AVG=imputing 
averages , RND=imputing by random, SC— imputing by min- 
imizing the stochastic complexity). 



Size 


M% 


AVG 


RND 


sc 




5 


34.76 


29.12 


53.04 




10 


34.98 


29.37 


52.12 


47 


20 


35.32 


29.95 


47.55 




35 


35.96 


29.63 


43.34 




50 


35.55 


29.60 


38.22 




5 


35.68 


29.88 


57.34 




10 


35.88 


29.42 


54.97 


119 


20 


35.87 


29.92 


51.68 




35 


35.95 


29.98 


47.64 




50 


36.12 


29.83 


42.14 




5 


36.50 


29.95 


57.67 




10 


35.78 


30.26 


56.27 


239 


20 


35.89 


29.89 


54.02 




35 


35.71 


29.92 


49.61 




50 


35.78 


29.67 


45.06 



Table 3 

The classification accuracy as a function of of the miss- 
ing data percentage (M%) when classifying 239 out-of- 
sample data vectors in the POPKIDS data set. Differ- 
ent columns indicate the performance of the models con- 
structed from the complete data matrices (sizes 47, 119 and 
239) obtained by different missing data handling schemes 
(ORG=the original complete data matrix, AVG=imputing 
averages, RND=imputing by random, SC=imputing by min- 
imizing the stochastic complexity, DEL-case deletion). 



Size 


M% 


AVG 


RND 


sc 


DEL 




5 


46.39 


46.87 


46.03 


41.68 




10 


43.04 


44.12 


44.24 


33.97 


47 


20 


34.81 


39.82 


39.85 


24.03 




35 


23.78 


32.04 


32.05 


14.99 




50 


17.75 


23.75 


25.07 


14.23 




5 


60.67 


61.15 


61.13 


55.21 




10 


56.49 


58.43 


58.26 


46.27 


119 


20 


46.00 


52.30 


52.85 


31.10 




35 


31.90 


42.81 


44.70 


16.85 




50 


25.58 


30.10 


33.82 


14.38 




5 


70.08 


70.86 


71.13 


66.47 




10 


66.07 


68.10 


69.11 


57.13 


239 


20 


56.59 


61.83 


63.29 


37.21 




35 


42.02 


51.92 


53.67 


18.23 




50 


33.81 


39.60 


42.93 


14.45 



Table 4 

The value of the stochastic complexity measure of the im- 
puted IRISH data matrix with different sample sizes as 
a function of the missing data percentage (M%). Dif- 
ferent columns indicate the different imputation schemes 
(ORG=the original complete data matrix, AVG=imputing 
averages, RND= imputing by random, SC=imputing by min- 
imizing the stochastic complexity). 



Size 


M% 


AVG 


RND 


sc 




5 


380.38 


381.33 


373.58 




10 


383.15 


386.50 


371.37 


50 


20 


384.56 


394.21 


367.10 




35 


376.72 


401.27 


360.97 




50 


355.28 


406.76 


355.29 




5 


901.80 


907.08 


877.15 




10 


912.56 


927.24 


870.25 


125 


20 


917.87 


958.75 


856.93 




35 


893.00 


992.44 


840.91 




50 


829.52 


1013.40 


820.93 




5 


1729.24 


1746.24 


1670.96 




10 


1755.35 


1799.88 


1656.73 


250 


20 


1765.85 


1880.21 


1627.49 




35 


1708.64 


1963.74 


1584.08 




50 


1572.05 


2016.08 


1538.59 
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Table 5 

The success rate of recovering the original complete IRISH 
data matrix with different sample sizes as a function of the 
missing data percentage (M%). Different columns indicate 
the different imputation schemes (ORG=the original com- 
plete data matrix, AVG=imputing averages , RND=imputing 
by random, SC= imputing by minimizing the stochastic com- 
plexity). 



Size 


M % 


AVG 


RND 


SC 




5 


36.73 


29.87 


57.60 




10 


36.47 


29.80 


57.23 


50 


20 


35.90 


31.72 


54.82 




35 


35.38 


31.15 


48.55 




50 


34.78 


29.80 


42.52 




5 


37.59 


29.89 


62.51 




10 


37.56 


29.81 


59.37 


125 


20 


37.05 


30.49 


56.79 




35 


37.21 


29.89 


51.44 




50 


36.73 


30.34 


44.85 




5 


36.85 


30.57 


62.13 




10 


37.08 


30.03 


60.37 


250 


20 


37.58 


30.49 


57.44 




35 


37.26 


30.76 


52.13 




50 


37.11 


30.56 


46.69 



Table 6 

The classification accuracy as a function of of the missing 
data percentage (M%) when classifying 250 out-of-sample 
data vectors in the IRISH data set . Different columns in- 
dicate the performance of the models constructed from the 
complete data matrices (sizes 50, 125 and 250) obtained 
by different missing data handling schemes (ORG=the 
original complete data matrix, AVG= imputing averages, 
RND= imputing by random, SC=imputing by minimizing the 
stochastic complexity, DEL=case deletion). 



Size 


M% 


AVG 


RND 


SC 


DEL 




5 


53.11 


53.21 


53.45 


52.58 




10 


50.30 


51.58 


52.04 


50.08 


50 


20 


43.14 


49.19 


50.30 


44.56 




35 


27.84 


43.75 


46.26 


30.24 




50 


17.72 


33.69 


38.06 


10.79 




5 


59.50 


59.68 


60.24 


59.26 




10 


57.26 


58.66 


59.35 


57.05 


125 


20 


49.02 


55.40 


57.31 


50.34 




35 


33.04 


49.66 


52.84 


41.24 




50 


20.70 


41.18 


44.80 


22.62 




5 


61.80 


61.84 


61.96 


61.79 




10 


60.34 


61.47 


61.80 


61.06 


250 


20 


53.80 


59.86 


60.74 


56.91 




35 


40.49 


54.74 


57.87 


46.85 




50 


27.01 


48.19 


51.40 


30.80 



works (Jensen, 1996; Pearl, 1988)) allows better “compres- 
sion” of the data, i.e., better completions can be made. It 
is important to notice that in this sense the information- 
theoretic approach to modeling based on compression is akin 
to Bayesian modeling, which also always is relative to a set 
of models. Naturally missing data estimation is only one 
particular application of the MDL-approach, for other in- 
teresting applications such as model selection, time series 
analysis and predictive modeling the literature on minimum 
encoding modeling approaches should be consulted (see for 
example (Baxter & Oliver, 1994; Kontkanen, Myllymaki, Si- 
lander, & Tirri, 1997b; Rissanen, 1987, 1989; Wallace & 
Freeman, 1987) and references therein). 
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